Saturday, February 3, 2007

Do You Know Where Your Cost Leaders Are?

I mentioned in an earlier post that allowing Nulls into a database can result in queries returning the "wrong" answer. Date has written about one example [1], one which he describes as "somewhat contrived." Here I present a different example, one which is not at all contrived (although, perhaps, of less weight).


Consider the following table (yet another variation on the famous EMP table):



EMPNO DEPTNO SALARY
1054 100 70000
1284 200 -null-
1523 100 80000
2472 300 40000
3232 200 60000
4421 300 -null-
5231 200 -null-
5236 300 65000
6211 200 65000


The nulls are present for a variation of the usual reason: we need to get the employees in the table so that they appear to be hired to satisfy one personnel function, but the paperwork for their salaries have not yet made it through the system, and so their salaries are not yet known.


Here is the question: Which department has the largest total salary?


The simplest approach is to run a query such as


SELECT DEPTNO, SUM(SALARY) FROM EMP
GROUP BY DEPTNO


Get


100 150000
200 125000
300 105000

in response, and conclude that the answer is department 100.


Do you agree? Would it help if I presented you with a more complete description of the table?


EMPNO DEPTNO SALARY CHECK ((SALARY IS NULL) OR (DEPTNO = 100
AND SALARY BETWEEN 50000 AND 90000)
OR(DEPTNO=200 AND SALARY BETWEEN 60000
AND 100000)
OR(DEPTNO=300 AND SALARY BETWEEN
65000 AND 80000))
1054 100 70000
1284 200 -null-
1523 100 80000
2472 300 40000
3232 200 60000
4421 300 -null-
5231 200 -null-
5236 300 65000
6211 200 65000


Each department has a salary range assigned. While we don't know exactly (yet) what employees 1284, 4421, and 5231 are going to be making, we know what their minimum will be. Therefore, a better list of the department salaries would be


100 exactly 150000
200 between 245000 and 325000
300 between 170000 and 190000


In other words, when all the evidence is in, department 200 will be found to have the greatest salary and 100 the least. All of the information needed to arrive at the correct result is present in the database, yet there is no simple way to ask the database for this result.


But there is a complication here also. More next time.



----------------------------------
[1] Much Ado About Nothing, at http://www.dbdebunk.com/page/page/1706814.htm