Saturday, January 10, 2015

A Different Use for Analytic Functions

I subscribe to a couple of the LinkedIn Oracle discussion groups.  A recent question caught my eye, and I decided to investigate it further, mostly because it relates to a favorite activity of mine this time of year:  my (US) Federal Income Tax.

This is not the original statement of the problem, but it does describe the essence:

One is given the following table:

Income Limit    Tax Rate
18150              10%
73800              15%
148850             25%
226850             28%
405100             33%
457600             35%
over 457600      39.6%

The tax due is computed as follows:  For taxable incomes up to 18150, the tax is 10% of the taxable income.  For incomes between 18150 and 73800 the tax is 10% of 18150 plus 15% of the difference between the income and 18150, etc.

The question then, is, is to design the table and state the query to calculate the tax due.

The comments had a CASE statement being used.  This struck me as unnecessary, so I set out to see if one could solve this problem without one.

In the tax forms, the method is typically stated with a table such as this:

If Your Taxable     But    Your              Of the
Taxable Income is   Less    tax      Plus    Amount 
Greater than        Than     is              Over
      0            18150      0       10%        0
  18150            73800   1815       15%    18150
  73800           148850  10163       25%    73800
 148850           226850  28925       28%   148850 
 226850           405100  50765       33%   226850
 405100           457600 109587       35%   405100
 457600                  127962     39.6%   457600

It is easy to create and load such a table:

Create table tax_brackets (
  lowrange    number(9),
  highrange   number(9),
  taxbase     number(9),
  taxpct      number(4,4))
/

Insert into tax_brackets values
   (148850,  226850,  28925,  .28)
/
Insert into tax_brackets values
   (405100,  457600, 109587,   .35)
/
Insert into tax_brackets values
   (0, 18150, 0,  .10)
/
Insert into tax_brackets values
   (73800,  148850,  10163, .25)
/
Insert into tax_brackets values
   (18150,  73800,  1815,  .15)
/
Insert into tax_brackets values
   (457600,  999999999, 127962, .396)
/
Insert into tax_brackets values
   (226850,  405100, 50765, .33)
/

commit
/


Where I have just used the first four columns of my sample table, and given the columns the names LowRange, HighRange, TaxBase and TaxPct. Having done this, we can calculate the tax due for a given taxable Income with
 

Select TaxBase + TaxPct * (Income - LowRange)  TaxDue
from tax_Brackets
where Income > LowRange
  and Income <= HighRange
/

No CASEs to be found anywhere.

A few words of clarification here: The reader who is carefully reading will have noticed that I have appeared to have loaded the data in random order. This was deliberate. I wanted to make sure that there was no accidental ordering affecting my results. And I included a very large HighRange for my 39.6% row. This was because I didn't want to complicate matters with an almost value.

Of course, this table has some seemingly redundant columns. All the required information can be found in this table:
 

   BRACKET   TAX_RATE
---------- ----------
         0         .1
     18150        .15
     73800        .25
    148850        .28
    226850        .33
    405100        .35
    457600       .396

So, let's create it and see if we can use it instead:
 

Create table tax_bracket2 (
   bracket   number(9),
   tax_rate  number(4,4)  )
/

insert into tax_bracket2 values(   0,   .10)
/
insert into tax_bracket2 values(405100,   .35) 
/

insert into tax_bracket2 values(73800,   .25) 
/
insert into tax_bracket2 values(148850,   .28) 
/
insert into tax_bracket2 values(18150,   .15) 
/
insert into tax_bracket2 values(226850,   .33) 
/

insert into tax_bracket2 values(457600,   .396)
/

commit
/

We know how to solve this problem with the tax_bracket table. If we could write a query to drive the two missing columns, we could use it as a subquery in our original solution. The rest of this post describes how to do precisely that.

To clarify things, consider this query:
 

 select Bracket  LowRange,
            '?'      HighRange,
            '?'      TaxBase,
            Tax_Rate TaxPct
     from tax_bracket2
     order by 1;
/

  LOWRANGE H T     TAXPCT
---------- - - ----------
         0 ? ?         .1
     18150 ? ?        .15
     73800 ? ?        .25
    148850 ? ?        .28
    226850 ? ?        .33
    405100 ? ?        .35
    457600 ? ?       .396

7 rows selected.

 

Clearly (?) this is going to require Analytic Functions. Unfortunately, everytime I need to use an analytic function, I have to start at the beginning. So let's start with the HighRange Column. Reading the definition of the LEAD Analytic Function we see

LAG/LEAD
Finding a value in a row a specified number of rows from a current row.

This looks promising. Here is the syntax:

{LAG | LEAD} ( value_expr [, offset] [, default] )
[RESPECT NULLS|IGNORE NULLS]
OVER ( [query_partition_clause] order_by_clause )

Let's try and plug in some values. For a given row, the value of its HighRange function is the value of the bracket column in the 'next' row. So, we are going to use the LEAD function, and the expression is going to simply be 'bracket'. And since we want the next row, we are going to try an offset of '1'.

At this stage, our function is

LEAD(bracket,1) OVER (PARTITION BY??? ORDER BY ???)

I used to think of the 'PARTITION BY' clause as being a Group By clause for analytic functions. But, after doing my research on this, I've come to think of it more as a Break command in a SQL*Plus report. Using both approaches, it is clear that this is going to be one large grouping, so there should be no PARTITION BY clause. And we can order either by 'bracket' or 'TaxPct'. So, let's try:
 

  select bracket LowRange,
        LEAD(bracket,1) OVER (ORDER BY bracket) HighRange,
        '?'  TaxBase,
        tax_rate TaxPct
    from tax_bracket2
    order by tax_rate
/

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      18150 ?         .1
     18150      73800 ?        .15
     73800     148850 ?        .25
    148850     226850 ?        .28
    226850     405100 ?        .33
    405100     457600 ?        .35
    457600          0 ?       .396

7 rows selected.

I see what went wrong. We are selecting all seven rows from tax_bracket2. The LEAD function is operating on each row. It has sorted all seven rows, and is returning the next 'bracket' value (which I have aliased to LowRange) as the HighRange value in the result set. But, when it operates on the row with 'bracket'=457600, there is no next row. So, it returns the default value. And since we haven't supplied a default value, it defaults to 0.

So, let's try this instead:
 select bracket LowRange,
        LEAD(bracket,1,999999999) OVER (ORDER BY bracket) HighRange,
        '?'  TaxBase,
        tax_rate TaxPct
     from tax_bracket2
     order by tax_rate
/

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      18150 ?         .1
     18150      73800 ?        .15
     73800     148850 ?        .25
    148850     226850 ?        .28
    226850     405100 ?        .33
    405100     457600 ?        .35
    457600  999999999 ?       .396

7 rows selected.

Just for fun, let's see what happens if we make the second parameter = 2,
 

  LOWRANGE  HIGHRANGE T     TAXPCT
---------- ---------- - ----------
         0      73800 ?         .1
     18150     148850 ?        .15
     73800     226850 ?        .25
    148850     405100 ?        .28
    226850     457600 ?        .33
    405100  999999999 ?        .35
    457600  999999999 ?       .396

7 rows selected.

And, just to spare you from having to try it, using '-1' results in an error.

 Now, consider this query and result set from our original table:
 

  1  select LowRange,
  2         HighRange,
  3         (HighRange - LowRange) * TaxPct,
  4         TaxPct
  5  from tax_brackets
  6* order by TaxPct
 /

  LOWRANGE  HIGHRANGE (HIGHRANGE-LOWRANGE)*TAXPCT     TAXPCT
---------- ---------- --------------------------- ----------
         0      18150                        1815         .1
     18150      73800                      8347.5        .15
     73800     148850                     18762.5        .25
    148850     226850                       21840        .28
    226850     405100                     58822.5        .33
    405100     457600                       18375        .35
    457600  999999999                   395818790       .396

7 rows selected.

If we compute a running total of our third column, we get the column I am calling 'TaxBase'. And the SUM analytic function can compute this:
 


  1  select LowRange,
  2         HighRange,
  3         nvl(sum((HighRange - LowRange)*TaxPct)
  4            OVER (order by TaxPct
  5                  rows between unbounded preceding and 1 preceding),0) TaxBase,
  6         TaxPct
  7    from tax_brackets
  8*   order by TaxPct
/

  LOWRANGE  HIGHRANGE    TAXBASE     TAXPCT
---------- ---------- ---------- ----------
         0      18150          0         .1
     18150      73800       1815        .15
     73800     148850    10162.5        .25
    148850     226850      28925        .28
    226850     405100      50765        .33
    405100     457600   109587.5        .35
    457600  999999999   127962.5       .396

7 rows selected.

And, since we know how to compute our 'HighRange' column from our simple tax_bracket2 table, we can derive the entire tax_brackets table thusly:
 

  1  select LowRange,
  2         HighRange,
  3         nvl(sum((HighRange - LowRange)*TaxPct)
  4            OVER (order by TaxPct
  5                  rows between unbounded preceding and 1 preceding),0) TaxBase,
  6         TaxPct
  7    from  (
  8           select bracket LowRange,
  9              LEAD(bracket,1,9999999990) OVER (ORDER BY bracket) HighRange,
 10              tax_rate TaxPct
 11           from tax_bracket2
 12          )
 13*   order by TaxPct

  LOWRANGE  HIGHRANGE    TAXBASE     TAXPCT
---------- ---------- ---------- ----------
         0      18150          0         .1
     18150      73800       1815        .15
     73800     148850    10162.5        .25
    148850     226850      28925        .28
    226850     405100      50765        .33
    405100     457600   109587.5        .35
    457600 9999999990   127962.5       .396

7 rows selected.

And we can write our final tax calculation query thusly:
 

Select TaxBase + TaxPct * (Income - LowRange)  TaxDue
 from (
    select LowRange,
           HighRange,
           nvl(sum((HighRange - LowRange)*TaxPct)
           OVER (order by TaxPct
                rows between unbounded preceding and 1 preceding),0) TaxBase,
           TaxPct
     from  (
            select bracket LowRange,
                  LEAD(bracket,1,9999999990) OVER (ORDER BY bracket) HighRange,
                  tax_rate TaxPct
              from tax_bracket2
            )
      order by TaxPct
        )
   where Income > LowRange
     and Income <= HighRange
/

Packaging this as a function is an excerise left to the reader.