Wednesday, January 31, 2007

Say What You Mean, and Mean What You Say (Part II)

Last time I rambled on, trying to set some context for how to interpret a 3rd (or 4th) truth value of "not applicable". What I was hoping to do was to make plausible the following truth table for it's conjunction:


P^Q | T | F | NA
----------------
T | T | F | T
----------------
F | F | F | F
----------------
NA | T | F | NA


This is the plausibility argument: if a predicate does not apply, then it can be ignored. Thus, T^NA becomes simply T, since the second predicate can be ignored. Obviously, one could continue in this vein, and construct a truth table for this logic's disjunction:



PvQ | T | F | NA
----------------
T | T | T | T
----------------
F | T | F | F
----------------
NA | T | F | NA


But, if we do this, we find that the distributive law does not hold. In particular, if one considers:



P^(QvR) = (P^Q)v(P^R)


for the truth values P=True, Q=False, and R=NULL, One finds that the left and right side both evaluate to NULL in standard SQL, but the left side evaluates to False and the right to True in my hypothetical 3 valued logic. So, this is not going to become part of SQL anytime soon.


To sum up, what is the point of all this? Here is the short answer:


  • Values can be missing for at least two reasons: the proper value is currently unknown, and/or the value does not apply.

  • The SQL Null is tailored for the "unknown" reason.

  • Using a Null to mean "does not apply" will produce confusing results at best, and downright wrong results at worst.

Wednesday, January 24, 2007

Say What You Mean, and Mean What You Say (Part I)

The SQL NULL is defined to mean one very specific thing: Whatever this datum may be, it is currently unknown. The problem is that "unknown" does not encompass all reasons for a value being missing (and missing values was what NULLS were created for in the first place). Codd recognized this, and proposed two kinds of missing values in his relational model: missing, but not applicable, and missing, and also applicable [The article by Fabian Pascal, Understanding NULLS, a Contradiction in Terms, found here describes this in more detail, along with some arguments against having NULLS as part of SQL]. Here I want to examine how a Not Applicable situation differs from an Unknown situation.

Consider the following SQL Table (which is a variation of a table I once had to work with. And before anyone screams, this table has been stripped down to the bare essentials of the problem. And the table was actually one of several IMS segments translated verbatim into Oracle. No attempts at data normalization at this stage (that was reserved for the Data Mart):






















ID C1 C2 VAL
1 10 20 1000
2 40 40 1111
3 15 NULL 1234
4 33 NULL 2222

Again, this was part of a data warehouse of data from a legacy system. My job was to extract rows of interest for the Data Mart and then summarize them. The problem was to identify these rows.

C1 and C2 are "Code" columns. In the beginning, only C2 was in the system. Rows would be of interest if C2 was less than some supplied value, which for the purposes of this example will be taken to be 30. At a later time, C1 was added. The intention was for C1 to replace C2, and it eventually did, but not right away. At the time I came into the picture, some rows would have values for both C1 and C2, and some would only have values for C1. So the rules for what was of interest were as follows:

  1. If C1 and C2 both exist, extract the row if both are less than the supplied value.
  2. If only C1 exists, extract the row if it is less than the supplied value.
In our example, the rows of interest have the ID column equal to 1 and 3.

While giving the specifications in this manner might tempt a SQL to turn to PL/SQL, it is actually quite easy to express this as a single SQL predicate (using X as the supplied value):


((C1 <>


or, as I actually wrote it


((C1 <>


What this illustrates is that a very natural Truth Table for the AND operator in this three valued logic with Not Applicable is as follows (using NA for Not Applicable):







ANDTFNA
TTFT
FFFF
NATFNA


Which differs from the NULL Truth Table in the case of (T) AND (NA).


But there are more complications. More to follow.

Saturday, January 20, 2007

More on Empty Strings

There is one feature in the last post (Outer Joins and Nulls) which deserves a closer look. But first,

Consider the string 'AB'. This is far from an unknown string. We know exactly how many characters long it is. We know exactly what characters they are. About the only thing we do not know about it is why anyone would bring it up in the first place. But, to continue ...

Consider the string 'A'. Exactly the same considerations apply to it as to 'AB'. Moreover, if we were to create two variables, V1 and V2, and assign the string 'AB' to each, we would regard any computer application in which

(V1 equals V2)

was false as being very wrong headed. [I write the above as (V1 equals V2) to get around the issue of whether the operator for string equality should be '=', '==', or 'eq' .]

Proceeding to the next step, if 'AB' is well known, and 'A' is well known, then '' should be regarded as well known. We know that it is a string of zero length, we know that it contains no characters, and we should expect '' to equal ''. Indeed, at the end of this post, I have included a listing from a SUSE Linux session where this is shown to be the case there).

What does any of this have to do with the last post?

In that post, the critical statement was the line " and b.dd = '' ". What was left unsaid, to be understood by the reader, is that Oracle interprets this statement as

and b.dd = NULL

which always evaluates to False. How this happens is worth a deeper look.

First of all, the beginning of this post demonstrated that the empty string and the SQL NULL are two entirely different things. A NULL can take any value (and thus it's exact value is unknown). An empty string has a well know, well defined value. A NULL can have any length (and thus it's precise length is unknown). An empty string is exactly zero characters long. A NULL can be equal to anything, and thus is equal to nothing. An empty string is only equal to another empty string.

Unfortunately, Oracle long ago adopted the kludge of using the empty string to represent the SQL NULL internally. Oracle realizes this, and has always included a statement in their documentation that this is non-standard behavior, that it may change without warning, and that no one should rely on it.

And now for a personal comment. After a few months with Oracle 5, I had nothing to do with Oracle for a few years. When I came back, it was to release 7.1.14 (stuck in my mind). Then I took a job with a company which was in the process of converting (it happened the 3rd week I was there) from release 6 to release 7.0.x. Everything went fine, and then a few months later we converted to 7.1. Suddenly, my boss wanted to know what was wrong with the new release. All his PL/SQL, which was full of constructions such as

Where column_A = ''

Instead of

Where column_A IS NULL

were failing. I was of no help, as I had no idea how they could have possibly worked in the first place.

What happened in release 7.1 was that '' now became equivalent to NULL. And in the process, it became impossible to store or test for an empty string. And this is more than just an academic exercise. Search most Oracle forums, and you will find someone trying to port an application where empty strings are significant from some other database, and finding out that their existing, SQL-92 compliant code does not work.

Notice that I have always been careful to refer to '' as the empty string and not the null string. The reason should be obvious; I want to emphasize that the empty string is not a SQL NULL. Not everyone is a careful as I am, and one can find places where someone will talk as if it is obvious that a null string is a NULL value. But no reader of this should ever make that mistake.

--------------------------
Here is that Linux Shell Script Listing and Output I promised:
$ cat bigtest.sh

###Set Up Some Variables

V1='AA'
V1A='AB'
V2='AB'
V3='A'
V4='A'
V5=''
V6=''

###See if 'AA' is identical to 'AB'
if [ $V1 == $V2 ]
then echo "String \$V1 equals string \$V2"
else echo "String \$V1 is not equal to string \$V2"
fi

###See if 'AA' is identical to 'AA'
if [ $V1A == $V2 ]
then echo "String \$V1A equals string \$V2"
else echo "String \$V1A is not equal to string \$V2"
fi

###See if 'A' is identical to 'A'
if [ $V3 == $V4 ]
then echo "String \$V3 equals string \$V4"
else echo "String \$V3 is not equal to string \$V4"
fi

###Finally, See if '' is identical to ''
if [ $V5 == $V6 ]
then echo "String \$V5 equals string \$V6"
else echo "String \$V5 is not equal to string \$V6"
fi

l0=${#V5}
echo "String \$V5 is $l0 bytes long"

l1=${#V3}
echo "String \$V3 is $l1 bytes long"

l2=${#V1}
echo "String \$V1 is $l2 bytes long"
$



$ ./bigtest.sh
String $V1 is not equal to string $V2
String $V1A equals string $V2
String $V3 equals string $V4
String $V5 equals string $V6
String $V5 is 0 bytes long
String $V3 is 1 bytes long
String $V1 is 2 bytes long
$

Thursday, January 18, 2007

Outer Joins and Nulls

I had planned to post something else, then I found this in AskTom:

create table test1(aa varchar(10),bb varchar(10));
create table test2(cc varchar(10),aa varchar(10),dd varchar(10));
insert into test1 values('11','qaz');
insert into test1 values('22','wsx');
insert into test1 values('33','edc');
commit; .
select a.bb from test1 a,test2 b
where a.aa=b.aa(+)
and a.bb='qaz'
and b.dd(+)='';

Run in 9i, the poster obtained the result 'qaz'.

However, when this was run in a 10g instance, no rows were returned. The poster (this was a comment to an prior question, so there reason to expect Tom to have posted his response) asked two obvious questions:

1) Why is this a bug?, and

2) How can this be?

The answer to the first is obvious: If the same query on the same tables gives two different results in two different database versions, there is probably a bug (the weasel will be exposed soon). The second is more involved. I am posting the details here; I am posting a summary to AskTom.

Let us take this query, and try to rewrite it using Ansi syntax. We can get this far:

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa

And at this point things become ambiguous. That b.dd=''. Is that supposed to be applied before or after the outer join? In other words, should this get translated into the Ansi syntax as

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa
and b.dd = ''
where a.bb = 'qaz'

or as

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa
where a.bb = 'qaz'
and b.dd = ''

In the first case, you proceed as follows:

1.a) Take this table:

AA | BB
--------------------
11 | qaz
22 | wsx
33 | edc


And left outer join it to this table:

CC | AA | DD
-------------------------

(You applied a predicate which can return no rows to an empty table; you end up with an empty table).

The result is this intermediate table:

BB
------
qaz
wsx
edc

Finally, you select the row where BB = 'qaz'.


In the second case, you first do the join, and end up with the same intermediate table. You then apply both predicates, one of which can return no rows. Thus you end up with no rows.

I have not been able to find any statement by Oracle as to which assumption is made with the old syntax (and the optimizer may be capable of either). My theory is that 9i does things one way, and 10g can sometimes do things the other way.

Sunday, January 14, 2007

The Adventure Begins

Here is my first post, and I have been spending the last week trying to come up
with something sufficiently profound. Since I have not been able to do that, I
have decided to post something simple instead. Future weeks will expand on this.

The first newsgroups I read on a regular basis were comp.databases.oracle.server
and comp.databases.theory. In the latter, I noticed some comments regarding
nulls seemed to occur on a regular basis, and I wondered what was behind them. A
little over a year ago, I decided to research this and find out what the nature
of this controversy was. I now think that I understand where the "No Nulls"
group is coming from.

So, I am going to start with a (very) brief summary of the opposing arguments.
Subsequent posts will elaborate on them (if some other Oracle related notion
doesn't distract me).

To begin with, I need some exposition of SQL nulls. Rather than attempt to
improve upon what many have done before, I will just cite what I regard as the
best article on the subject:

http://www.oracle.com/technology/oramag/oracle/05-jul/o45sql.html

In the late 1960's, Codd and others developed their relational algebra, which
became the basis for the theory behind modern relational data modeling. Whenever
you see someone speak of tuples rather than tables, relations rather than
statements, or types rather than domains, it is in the context of this algebra
that they are speaking. At the time, there was some hope that should relational
databases become practical, this algebra would be their language. Instead, SQL
became the standard interface. And SQL, while close to relational algebra,
differs in a number of essential aspects.

One of these aspects is the existence of nulls, more precisely, how they are
implemented. Nulls turn the logic of SQL into a three valued logic. Relational
algebra is based on the traditional two valued logic. And so nulls become one of
the things wrong with SQL.

But there is a bigger problem. Nulls, as defined by SQL, result in some
unexpected and surprising traps. For example, a column being nullable (whether or
not it actually contains any nulls) will prevent the optimizer from using an
Anti-Join to optimize a query. And the existence of nulls can even result in a
logically wrong result set.

But on the other hand, it was none other than Codd himself who introduced nulls
into the relational database realm. His 3rd feature of relational databases was
a "systematic way of handling missing information." And while he proposed was
not exactly what SQL developed, he did remain a proponent of nulls all his life.

Well, this has gotten me started. Tomorrow, I should have more focus.