Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

Saturday, January 10, 2015

A Different Use for Analytic Functions

I subscribe to a couple of the LinkedIn Oracle discussion groups.  A recent question caught my eye, and I decided to investigate it further, mostly because it relates to a favorite activity of mine this time of year:  my (US) Federal Income Tax.

This is not the original statement of the problem, but it does describe the essence:

One is given the following table:

Income Limit    Tax Rate
18150              10%
73800              15%
148850             25%
226850             28%
405100             33%
457600             35%
over 457600      39.6%

The tax due is computed as follows:  For taxable incomes up to 18150, the tax is 10% of the taxable income.  For incomes between 18150 and 73800 the tax is 10% of 18150 plus 15% of the difference between the income and 18150, etc.

The question then, is, is to design the table and state the query to calculate the tax due.

The comments had a CASE statement being used.  This struck me as unnecessary, so I set out to see if one could solve this problem without one.

In the tax forms, the method is typically stated with a table such as this:

If Your Taxable     But    Your              Of the
Taxable Income is   Less    tax      Plus    Amount 
Greater than        Than     is              Over
      0            18150      0       10%        0
  18150            73800   1815       15%    18150
  73800           148850  10163       25%    73800
 148850           226850  28925       28%   148850 
 226850           405100  50765       33%   226850
 405100           457600 109587       35%   405100
 457600                  127962     39.6%   457600

It is easy to create and load such a table:

Create table tax_brackets (
  lowrange    number(9),
  highrange   number(9),
  taxbase     number(9),
  taxpct      number(4,4))
/

Insert into tax_brackets values
   (148850,  226850,  28925,  .28)
/
Insert into tax_brackets values
   (405100,  457600, 109587,   .35)
/
Insert into tax_brackets values
   (0, 18150, 0,  .10)
/
Insert into tax_brackets values
   (73800,  148850,  10163, .25)
/
Insert into tax_brackets values
   (18150,  73800,  1815,  .15)
/
Insert into tax_brackets values
   (457600,  999999999, 127962, .396)
/
Insert into tax_brackets values
   (226850,  405100, 50765, .33)
/

commit
/


Where I have just used the first four columns of my sample table, and given the columns the names LowRange, HighRange, TaxBase and TaxPct. Having done this, we can calculate the tax due for a given taxable Income with
 

Select TaxBase + TaxPct * (Income - LowRange)  TaxDue
from tax_Brackets
where Income > LowRange
  and Income <= HighRange
/

No CASEs to be found anywhere.

A few words of clarification here: The reader who is carefully reading will have noticed that I have appeared to have loaded the data in random order. This was deliberate. I wanted to make sure that there was no accidental ordering affecting my results. And I included a very large HighRange for my 39.6% row. This was because I didn't want to complicate matters with an almost value.

Of course, this table has some seemingly redundant columns. All the required information can be found in this table:
 

   BRACKET   TAX_RATE
---------- ----------
         0         .1
     18150        .15
     73800        .25
    148850        .28
    226850        .33
    405100        .35
    457600       .396

So, let's create it and see if we can use it instead:
 

Create table tax_bracket2 (
   bracket   number(9),
   tax_rate  number(4,4)  )
/

insert into tax_bracket2 values(   0,   .10)
/
insert into tax_bracket2 values(405100,   .35) 
/

insert into tax_bracket2 values(73800,   .25) 
/
insert into tax_bracket2 values(148850,   .28) 
/
insert into tax_bracket2 values(18150,   .15) 
/
insert into tax_bracket2 values(226850,   .33) 
/

insert into tax_bracket2 values(457600,   .396)
/

commit
/

We know how to solve this problem with the tax_bracket table. If we could write a query to drive the two missing columns, we could use it as a subquery in our original solution. The rest of this post describes how to do precisely that.

To clarify things, consider this query:
 

 select Bracket  LowRange,
            '?'      HighRange,
            '?'      TaxBase,
            Tax_Rate TaxPct
     from tax_bracket2
     order by 1;
/

  LOWRANGE H T     TAXPCT
---------- - - ----------
         0 ? ?         .1
     18150 ? ?        .15
     73800 ? ?        .25
    148850 ? ?        .28
    226850 ? ?        .33
    405100 ? ?        .35
    457600 ? ?       .396

7 rows selected.

 

Clearly (?) this is going to require Analytic Functions. Unfortunately, everytime I need to use an analytic function, I have to start at the beginning. So let's start with the HighRange Column. Reading the definition of the LEAD Analytic Function we see

LAG/LEAD
Finding a value in a row a specified number of rows from a current row.

This looks promising. Here is the syntax:

{LAG | LEAD} ( value_expr [, offset] [, default] )
[RESPECT NULLS|IGNORE NULLS]
OVER ( [query_partition_clause] order_by_clause )

Let's try and plug in some values. For a given row, the value of its HighRange function is the value of the bracket column in the 'next' row. So, we are going to use the LEAD function, and the expression is going to simply be 'bracket'. And since we want the next row, we are going to try an offset of '1'.

At this stage, our function is

LEAD(bracket,1) OVER (PARTITION BY??? ORDER BY ???)

I used to think of the 'PARTITION BY' clause as being a Group By clause for analytic functions. But, after doing my research on this, I've come to think of it more as a Break command in a SQL*Plus report. Using both approaches, it is clear that this is going to be one large grouping, so there should be no PARTITION BY clause. And we can order either by 'bracket' or 'TaxPct'. So, let's try:
 

  select bracket LowRange,
        LEAD(bracket,1) OVER (ORDER BY bracket) HighRange,
        '?'  TaxBase,
        tax_rate TaxPct
    from tax_bracket2
    order by tax_rate
/

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      18150 ?         .1
     18150      73800 ?        .15
     73800     148850 ?        .25
    148850     226850 ?        .28
    226850     405100 ?        .33
    405100     457600 ?        .35
    457600          0 ?       .396

7 rows selected.

I see what went wrong. We are selecting all seven rows from tax_bracket2. The LEAD function is operating on each row. It has sorted all seven rows, and is returning the next 'bracket' value (which I have aliased to LowRange) as the HighRange value in the result set. But, when it operates on the row with 'bracket'=457600, there is no next row. So, it returns the default value. And since we haven't supplied a default value, it defaults to 0.

So, let's try this instead:
 select bracket LowRange,
        LEAD(bracket,1,999999999) OVER (ORDER BY bracket) HighRange,
        '?'  TaxBase,
        tax_rate TaxPct
     from tax_bracket2
     order by tax_rate
/

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      18150 ?         .1
     18150      73800 ?        .15
     73800     148850 ?        .25
    148850     226850 ?        .28
    226850     405100 ?        .33
    405100     457600 ?        .35
    457600  999999999 ?       .396

7 rows selected.

Just for fun, let's see what happens if we make the second parameter = 2,
 

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      73800 ?         .1
     18150     148850 ?        .15
     73800     226850 ?        .25
    148850     405100 ?        .28
    226850     457600 ?        .33
    405100  999999999 ?        .35
    457600  999999999 ?       .396

7 rows selected.

And, just to spare you from having to try it, using '-1' results in an error.

 Now, consider this query and result set from our original table:
 

  1  select LowRange,
  2         HighRange,
  3         (HighRange - LowRange) * TaxPct,
  4         TaxPct
  5  from tax_brackets
  6* order by TaxPct
 /

  LOWRANGE  HIGHRANGE (HIGHRANGE-LOWRANGE)*TAXPCT     TAXPCT
---------- ---------- --------------------------- ----------
         0      18150                        1815         .1
     18150      73800                      8347.5        .15
     73800     148850                     18762.5        .25
    148850     226850                       21840        .28
    226850     405100                     58822.5        .33
    405100     457600                       18375        .35
    457600  999999999                   395818790       .396

7 rows selected.

If we compute a running total of our third column, we get the column I am calling 'TaxBase'. And the SUM analytic function can compute this:
 


  1  select LowRange,
  2         HighRange,
  3         nvl(sum((HighRange - LowRange)*TaxPct)
  4            OVER (order by TaxPct
  5                  rows between unbounded preceding and 1 preceding),0) TaxBase,
  6         TaxPct
  7    from tax_brackets
  8*   order by TaxPct
/

  LOWRANGE  HIGHRANGE    TAXBASE     TAXPCT
---------- ---------- ---------- ----------
         0      18150          0         .1
     18150      73800       1815        .15
     73800     148850    10162.5        .25
    148850     226850      28925        .28
    226850     405100      50765        .33
    405100     457600   109587.5        .35
    457600  999999999   127962.5       .396

7 rows selected.

And, since we know how to compute our 'HighRange' column from our simple tax_bracket2 table, we can derive the entire tax_brackets table thusly:
 

  1  select LowRange,
  2         HighRange,
  3         nvl(sum((HighRange - LowRange)*TaxPct)
  4            OVER (order by TaxPct
  5                  rows between unbounded preceding and 1 preceding),0) TaxBase,
  6         TaxPct
  7    from  (
  8           select bracket LowRange,
  9              LEAD(bracket,1,9999999990) OVER (ORDER BY bracket) HighRange,
 10              tax_rate TaxPct
 11           from tax_bracket2
 12          )
 13*   order by TaxPct

  LOWRANGE  HIGHRANGE    TAXBASE     TAXPCT
---------- ---------- ---------- ----------
         0      18150          0         .1
     18150      73800       1815        .15
     73800     148850    10162.5        .25
    148850     226850      28925        .28
    226850     405100      50765        .33
    405100     457600   109587.5        .35
    457600 9999999990   127962.5       .396

7 rows selected.

And we can write our final tax calculation query thusly:
 

Select TaxBase + TaxPct * (Income - LowRange)  TaxDue
 from (
    select LowRange,
           HighRange,
           nvl(sum((HighRange - LowRange)*TaxPct)
           OVER (order by TaxPct
                rows between unbounded preceding and 1 preceding),0) TaxBase,
           TaxPct
     from  (
            select bracket LowRange,
                  LEAD(bracket,1,9999999990) OVER (ORDER BY bracket) HighRange,
                  tax_rate TaxPct
              from tax_bracket2
            )
      order by TaxPct
        )
   where Income > LowRange
     and Income <= HighRange
/

Packaging this as a function is an excerise left to the reader.

Wednesday, January 9, 2008

ORA-12705 "invalid or unknown NLS parameter value specified"

The "Official" explanation is that the NLS parameter value specified in an alter session statement is not valid or not implemented. Except that this is not quite the entire story. If we put some nonsense in an alter session statement, Oracle is able to figure that out:


SQL> alter session set dodah = hohah;
alter session set dodah = hohah
*
ERROR at line 1:
ORA-02248: invalid option for ALTER SESSION

O.K., let's try a bad option to a good parameter:

SQL> alter session set nls_date_format = dodah;
ERROR:
ORA-01821: date format not recognized

What we need is a valid NLS parameter (such as nls_date_format, we were on the right track there) and pass it a bad option, but pick an NLS parameter which doesn't have it's own editing available (which appears to be any of them except the ones which refer to dates or time) :

SQL> alter session set nls_territory = dodah;
ERROR:
ORA-12705: invalid or unknown NLS parameter value specified

This describes how the ORA-12705 can arise in an existing session. Can one get this error during the logon process? The answer is, yes. Just set the NLS_LANG environment variable to an invalid value, and try to connect:

$ export NLS_LANG=dodah
$ echo $NLS_LANG
dodah
$ sqlplus psinger1@oradev

SQL*Plus: Release 9.2.0.8.0 - Production on Wed Oct 24 07:48:45 2007

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

Enter password:
ERROR:
ORA-12705: invalid or unknown NLS parameter value specified

and NLS_LANG appears to be the only NLS parameter which can be set as an environment variable. So, in summary, if you get a 12705 and you haven't just made an obvious mistake with an alter session statement, check your NLS_LANG setting.

Saturday, February 3, 2007

Do You Know Where Your Cost Leaders Are?

I mentioned in an earlier post that allowing Nulls into a database can result in queries returning the "wrong" answer. Date has written about one example [1], one which he describes as "somewhat contrived." Here I present a different example, one which is not at all contrived (although, perhaps, of less weight).


Consider the following table (yet another variation on the famous EMP table):



EMPNO DEPTNO SALARY
1054 100 70000
1284 200 -null-
1523 100 80000
2472 300 40000
3232 200 60000
4421 300 -null-
5231 200 -null-
5236 300 65000
6211 200 65000


The nulls are present for a variation of the usual reason: we need to get the employees in the table so that they appear to be hired to satisfy one personnel function, but the paperwork for their salaries have not yet made it through the system, and so their salaries are not yet known.


Here is the question: Which department has the largest total salary?


The simplest approach is to run a query such as


SELECT DEPTNO, SUM(SALARY) FROM EMP
GROUP BY DEPTNO


Get


100 150000
200 125000
300 105000

in response, and conclude that the answer is department 100.


Do you agree? Would it help if I presented you with a more complete description of the table?


EMPNO DEPTNO SALARY CHECK ((SALARY IS NULL) OR (DEPTNO = 100
AND SALARY BETWEEN 50000 AND 90000)
OR(DEPTNO=200 AND SALARY BETWEEN 60000
AND 100000)
OR(DEPTNO=300 AND SALARY BETWEEN
65000 AND 80000))
1054 100 70000
1284 200 -null-
1523 100 80000
2472 300 40000
3232 200 60000
4421 300 -null-
5231 200 -null-
5236 300 65000
6211 200 65000


Each department has a salary range assigned. While we don't know exactly (yet) what employees 1284, 4421, and 5231 are going to be making, we know what their minimum will be. Therefore, a better list of the department salaries would be


100 exactly 150000
200 between 245000 and 325000
300 between 170000 and 190000


In other words, when all the evidence is in, department 200 will be found to have the greatest salary and 100 the least. All of the information needed to arrive at the correct result is present in the database, yet there is no simple way to ask the database for this result.


But there is a complication here also. More next time.



----------------------------------
[1] Much Ado About Nothing, at http://www.dbdebunk.com/page/page/1706814.htm

Wednesday, January 31, 2007

Say What You Mean, and Mean What You Say (Part II)

Last time I rambled on, trying to set some context for how to interpret a 3rd (or 4th) truth value of "not applicable". What I was hoping to do was to make plausible the following truth table for it's conjunction:


P^Q | T | F | NA
----------------
T | T | F | T
----------------
F | F | F | F
----------------
NA | T | F | NA


This is the plausibility argument: if a predicate does not apply, then it can be ignored. Thus, T^NA becomes simply T, since the second predicate can be ignored. Obviously, one could continue in this vein, and construct a truth table for this logic's disjunction:



PvQ | T | F | NA
----------------
T | T | T | T
----------------
F | T | F | F
----------------
NA | T | F | NA


But, if we do this, we find that the distributive law does not hold. In particular, if one considers:



P^(QvR) = (P^Q)v(P^R)


for the truth values P=True, Q=False, and R=NULL, One finds that the left and right side both evaluate to NULL in standard SQL, but the left side evaluates to False and the right to True in my hypothetical 3 valued logic. So, this is not going to become part of SQL anytime soon.


To sum up, what is the point of all this? Here is the short answer:


  • Values can be missing for at least two reasons: the proper value is currently unknown, and/or the value does not apply.

  • The SQL Null is tailored for the "unknown" reason.

  • Using a Null to mean "does not apply" will produce confusing results at best, and downright wrong results at worst.

Wednesday, January 24, 2007

Say What You Mean, and Mean What You Say (Part I)

The SQL NULL is defined to mean one very specific thing: Whatever this datum may be, it is currently unknown. The problem is that "unknown" does not encompass all reasons for a value being missing (and missing values was what NULLS were created for in the first place). Codd recognized this, and proposed two kinds of missing values in his relational model: missing, but not applicable, and missing, and also applicable [The article by Fabian Pascal, Understanding NULLS, a Contradiction in Terms, found here describes this in more detail, along with some arguments against having NULLS as part of SQL]. Here I want to examine how a Not Applicable situation differs from an Unknown situation.

Consider the following SQL Table (which is a variation of a table I once had to work with. And before anyone screams, this table has been stripped down to the bare essentials of the problem. And the table was actually one of several IMS segments translated verbatim into Oracle. No attempts at data normalization at this stage (that was reserved for the Data Mart):






















ID C1 C2 VAL
1 10 20 1000
2 40 40 1111
3 15 NULL 1234
4 33 NULL 2222

Again, this was part of a data warehouse of data from a legacy system. My job was to extract rows of interest for the Data Mart and then summarize them. The problem was to identify these rows.

C1 and C2 are "Code" columns. In the beginning, only C2 was in the system. Rows would be of interest if C2 was less than some supplied value, which for the purposes of this example will be taken to be 30. At a later time, C1 was added. The intention was for C1 to replace C2, and it eventually did, but not right away. At the time I came into the picture, some rows would have values for both C1 and C2, and some would only have values for C1. So the rules for what was of interest were as follows:

  1. If C1 and C2 both exist, extract the row if both are less than the supplied value.
  2. If only C1 exists, extract the row if it is less than the supplied value.
In our example, the rows of interest have the ID column equal to 1 and 3.

While giving the specifications in this manner might tempt a SQL to turn to PL/SQL, it is actually quite easy to express this as a single SQL predicate (using X as the supplied value):


((C1 <>


or, as I actually wrote it


((C1 <>


What this illustrates is that a very natural Truth Table for the AND operator in this three valued logic with Not Applicable is as follows (using NA for Not Applicable):







ANDTFNA
TTFT
FFFF
NATFNA


Which differs from the NULL Truth Table in the case of (T) AND (NA).


But there are more complications. More to follow.

Thursday, January 18, 2007

Outer Joins and Nulls

I had planned to post something else, then I found this in AskTom:

create table test1(aa varchar(10),bb varchar(10));
create table test2(cc varchar(10),aa varchar(10),dd varchar(10));
insert into test1 values('11','qaz');
insert into test1 values('22','wsx');
insert into test1 values('33','edc');
commit; .
select a.bb from test1 a,test2 b
where a.aa=b.aa(+)
and a.bb='qaz'
and b.dd(+)='';

Run in 9i, the poster obtained the result 'qaz'.

However, when this was run in a 10g instance, no rows were returned. The poster (this was a comment to an prior question, so there reason to expect Tom to have posted his response) asked two obvious questions:

1) Why is this a bug?, and

2) How can this be?

The answer to the first is obvious: If the same query on the same tables gives two different results in two different database versions, there is probably a bug (the weasel will be exposed soon). The second is more involved. I am posting the details here; I am posting a summary to AskTom.

Let us take this query, and try to rewrite it using Ansi syntax. We can get this far:

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa

And at this point things become ambiguous. That b.dd=''. Is that supposed to be applied before or after the outer join? In other words, should this get translated into the Ansi syntax as

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa
and b.dd = ''
where a.bb = 'qaz'

or as

select a.bb
from test1 a
left outer join test2 b
on a.aa = b.aa
where a.bb = 'qaz'
and b.dd = ''

In the first case, you proceed as follows:

1.a) Take this table:

AA | BB
--------------------
11 | qaz
22 | wsx
33 | edc


And left outer join it to this table:

CC | AA | DD
-------------------------

(You applied a predicate which can return no rows to an empty table; you end up with an empty table).

The result is this intermediate table:

BB
------
qaz
wsx
edc

Finally, you select the row where BB = 'qaz'.


In the second case, you first do the join, and end up with the same intermediate table. You then apply both predicates, one of which can return no rows. Thus you end up with no rows.

I have not been able to find any statement by Oracle as to which assumption is made with the old syntax (and the optimizer may be capable of either). My theory is that 9i does things one way, and 10g can sometimes do things the other way.

Sunday, January 14, 2007

The Adventure Begins

Here is my first post, and I have been spending the last week trying to come up
with something sufficiently profound. Since I have not been able to do that, I
have decided to post something simple instead. Future weeks will expand on this.

The first newsgroups I read on a regular basis were comp.databases.oracle.server
and comp.databases.theory. In the latter, I noticed some comments regarding
nulls seemed to occur on a regular basis, and I wondered what was behind them. A
little over a year ago, I decided to research this and find out what the nature
of this controversy was. I now think that I understand where the "No Nulls"
group is coming from.

So, I am going to start with a (very) brief summary of the opposing arguments.
Subsequent posts will elaborate on them (if some other Oracle related notion
doesn't distract me).

To begin with, I need some exposition of SQL nulls. Rather than attempt to
improve upon what many have done before, I will just cite what I regard as the
best article on the subject:

http://www.oracle.com/technology/oramag/oracle/05-jul/o45sql.html

In the late 1960's, Codd and others developed their relational algebra, which
became the basis for the theory behind modern relational data modeling. Whenever
you see someone speak of tuples rather than tables, relations rather than
statements, or types rather than domains, it is in the context of this algebra
that they are speaking. At the time, there was some hope that should relational
databases become practical, this algebra would be their language. Instead, SQL
became the standard interface. And SQL, while close to relational algebra,
differs in a number of essential aspects.

One of these aspects is the existence of nulls, more precisely, how they are
implemented. Nulls turn the logic of SQL into a three valued logic. Relational
algebra is based on the traditional two valued logic. And so nulls become one of
the things wrong with SQL.

But there is a bigger problem. Nulls, as defined by SQL, result in some
unexpected and surprising traps. For example, a column being nullable (whether or
not it actually contains any nulls) will prevent the optimizer from using an
Anti-Join to optimize a query. And the existence of nulls can even result in a
logically wrong result set.

But on the other hand, it was none other than Codd himself who introduced nulls
into the relational database realm. His 3rd feature of relational databases was
a "systematic way of handling missing information." And while he proposed was
not exactly what SQL developed, he did remain a proponent of nulls all his life.

Well, this has gotten me started. Tomorrow, I should have more focus.