Wednesday, January 24, 2007

Say What You Mean, and Mean What You Say (Part I)

The SQL NULL is defined to mean one very specific thing: Whatever this datum may be, it is currently unknown. The problem is that "unknown" does not encompass all reasons for a value being missing (and missing values was what NULLS were created for in the first place). Codd recognized this, and proposed two kinds of missing values in his relational model: missing, but not applicable, and missing, and also applicable [The article by Fabian Pascal, Understanding NULLS, a Contradiction in Terms, found here describes this in more detail, along with some arguments against having NULLS as part of SQL]. Here I want to examine how a Not Applicable situation differs from an Unknown situation.

Consider the following SQL Table (which is a variation of a table I once had to work with. And before anyone screams, this table has been stripped down to the bare essentials of the problem. And the table was actually one of several IMS segments translated verbatim into Oracle. No attempts at data normalization at this stage (that was reserved for the Data Mart):






















ID C1 C2 VAL
1 10 20 1000
2 40 40 1111
3 15 NULL 1234
4 33 NULL 2222

Again, this was part of a data warehouse of data from a legacy system. My job was to extract rows of interest for the Data Mart and then summarize them. The problem was to identify these rows.

C1 and C2 are "Code" columns. In the beginning, only C2 was in the system. Rows would be of interest if C2 was less than some supplied value, which for the purposes of this example will be taken to be 30. At a later time, C1 was added. The intention was for C1 to replace C2, and it eventually did, but not right away. At the time I came into the picture, some rows would have values for both C1 and C2, and some would only have values for C1. So the rules for what was of interest were as follows:

  1. If C1 and C2 both exist, extract the row if both are less than the supplied value.
  2. If only C1 exists, extract the row if it is less than the supplied value.
In our example, the rows of interest have the ID column equal to 1 and 3.

While giving the specifications in this manner might tempt a SQL to turn to PL/SQL, it is actually quite easy to express this as a single SQL predicate (using X as the supplied value):


((C1 <>


or, as I actually wrote it


((C1 <>


What this illustrates is that a very natural Truth Table for the AND operator in this three valued logic with Not Applicable is as follows (using NA for Not Applicable):







ANDTFNA
TTFT
FFFF
NATFNA


Which differs from the NULL Truth Table in the case of (T) AND (NA).


But there are more complications. More to follow.

No comments: