Tuesday, December 31, 2013

Analytic dilemma: row_number vs rank vs dense_rank

Row_number() , rank(), dense_rank()

                                                                                             
So what are the differences between those functions? And for which purpose they are used for

create table bank_customers (
id number,
full_name varchar2(100),
birth_date date);

create table current_account (
customer_id number,
action varchar2(100),
amount number,
action_date date);
----------------------------------------
Table created.
Table created.

insert into bank_customers values (1, 'Tal Avinari', sysdate);
insert into bank_customers values (2, 'Daniel Kagan', sysdate);
insert into bank_customers values (3, 'Ben David', sysdate);
insert into bank_customers values (4, 'Alex Rubin', sysdate);
insert into current_account
values (1, 'deposit', 100, to_date('01/01/2013', 'dd/mm/yyyy'));
insert into current_account
values (1, 'withdrawal', 230, to_date('01/01/2013', 'dd/mm/yyyy'));
insert into current_account
values (1, 'withdrawal', 100, to_date('02/01/2012', 'dd/mm/yyyy'));
insert into current_account
values (2, 'withdrawal', 1000, to_date('03/01/2012', 'dd/mm/yyyy'));
insert into current_account
values (2, 'deposit', 1060, to_date('25/05/2012', 'dd/mm/yyyy'));
insert into current_account
values (3, 'withdrawal', 560, to_date('25/01/2013', 'dd/mm/yyyy'));
insert into current_account
values (1, 'deposit', 670, to_date('01/01/2012', 'dd/mm/yyyy'));


select b.full_name, a.action_date, a.amount, a.action
from current_account a,
bank_customers b
where a.customer_id = b.id;

FULL_NAM   ACTION_DATE                                                                         AMOUNT ACTION
-------------------- ------------------ ---------- --------------------
Tal Avinari     01-JAN-12                                                                                          100 deposit
Tal Avinari     01-JAN-13                                                                                          100 deposit
Tal Avinari     01-JAN-13                                                                                          230 withdrawal
Tal Avinari     02-JAN-12                                                                                          100 withdrawal
Daniel Kagan     03-JAN-12                                                                                     1000 withdrawal
Daniel Kagan     25-MAY-12                                                                                  1060 deposit
Ben David     25-JAN-13                                                                                           560 withdrawal
Tal Avinari     01-JAN-12                                                                                          670 deposit

8 rows selected.

Let’s start with a little motivation, assuming we want to show all the bank customers and their last account action.
The analytic function row_number() is a good solution, let’s see how it works:

 select row_number() over (partition by b.id order by a.action_date desc) as row_number,b.full_name, a.action_date,a.amountfrom current_account a,bank_customers bwhere a.customer_id = b.id; 
ROW_NUMBER FULL_NAME              ACTION_DATE        AMOUNT
---------- ------------------ ------------------ ----------
             1 Tal Avinari               01-JAN-13                      230
             2 Tal Avinari               01-JAN-13                      100
             3 Tal Avinari               02-JAN-12                      100
             4 Tal Avinari               01-JAN-12                      670
             5 Tal Avinari               01-JAN-12                      100
             1 Daniel Kagan           25-MAY-12                  1060

             2 Daniel Kagan           03-JAN-12                    1000


So the function actually partitions the table by the customer id and counts the rows one by one according to the order by clause. So the final step is to filter it to rownumber = 1 and we’ve finished!

Unfortunately, in our case we have 2 account actions for Tal at the same date, so if we take the first rownumber we’ll lose information.
Rank() is the answer for that !
Let’s have a look

select rank() over (partition by b.id order by a.action_date desc) as rank,
b.full_name, 
a.action_date,
a.amount
from current_account a,
bank_customers b
where a.customer_id = b.id;

 ------------------------------------------------------------------------------------

RANK         FULL_NAME                      ACTION_DATE        AMOUNT
---------- ------------  ------------------------  -------------------------
             1 Tal Avinari  01-JAN-13                      230
             1 Tal Avinari   01-JAN-13                      100
             3 Tal Avinari   02-JAN-12                      100
             4 Tal Avinari   01-JAN-12                      670
             4 Tal Avinari   01-JAN-12                      100
             1 Daniel Kagan 25-MAY-12               1060
             2 Daniel Kagan 03-JAN-12                 1000
             1 Ben David    25-JAN-13                      560

8 rows selected.

Notice that after the two rows of rank = 1 we have the next rank, =3.
If we don’t want this gap then we use the dense_rank() function.
  
select dense_rank() over (partition by b.id order by a.action_date desc) as dense_rank,
b.full_name, 
a.action_date,
a.amount
from current_account a,
bank_customers b
where a.customer_id = b.id;

DENSE_RANK FULL_NAME                 ACTION_DATE        AMOUNT
----------------------  ------------------------  ------------------------- --------------
             1 Tal Avinari  01-JAN-13                      230
             1 Tal Avinari   01-JAN-13                      100
             2 Tal Avinari   02-JAN-12                      100
             3 Tal Avinari   01-JAN-12                      670
             3 Tal Avinari   01-JAN-12                      100
             1 Daniel Kagan 25-MAY-12               1060
             2 Daniel Kagan 03-JAN-12                 1000
             1 Ben David    25-JAN-13                      560
8 rows selected.


Hope you enjoyed it.
Happy new year !

  

Nice little SQL tricks #1

Here are some nice little SQL tricks:


1) Filter or not

Sometimes you want to get an input from the user and use that value for filtering your query.

Let's say:

SELECT count(1) 
FROM emp
WHERE dept_id = :input;

And that's fine as long as the user obligated to choose a specific department.
We want to give him the option of selecting 'All departments', in the same query.

The best way to do so is:
(without using "or" or "union all")

SELECT count(1) 
FROM emp
WHERE nvl(:input, dept_id) = dept_id;

While the application sends NULL whenever 'all departments' has been chosen.

2) Dynamic table

If you want to dynamically create a table that contains numbers between 1 to 50 (or any other natural number), try this one:

SELECT level as num
FROM dual
CONNECT BY level < 50; 

This trick comes in hand a lot in data migration, when you have one row need to be splitted into several rows. (Just make a Cartesian product between the original and the dynamic).
For example, the following query produces 3 rows for each employee in the table:

SELECT emp_id, num
FROM emp, (select level as num from dual connect by level < 4);


3) Paging mechanism

If you have a need for taking only rows 20-30 (like if you're task involves paging)
you can use the row_number analytic function which in this case, order the result in your column(s) choice and give each row a unique number in that order.


In the following example, we show 10 employees, ordered by their hire date, starting with the 20th employee:

SELECT * FROM
(
      SELECT e.*, row_number() over (order by date_hire desc) as rwnm
      FROM emp
)
WHERE rwnm BETWEEN 20 AND 30;


More tricks soon!












Monday, December 30, 2013

Tutorial: How to return a table

Oracle lets you create a function that returns a table.
This can be useful in some situations like creating a view that receives arguments or creating a dynamic table.

In this tutorial, I'll show you how to create a function that returns the Fibonacci series.

-If you have an idea how to do that without functions and/or other tables, please comment-


First, create the type of row that we want to return:

CREATE TYPE t_fibunacci_row AS OBJECT 
(
 id           NUMBER,
 value        NUMBER
)

Create an array of the previous type:

CREATE TYPE t_fibunacci_array IS TABLE OF t_fibunacci_row


Now the function will look like this (Don't forget to use the PIPELINED clause):

CREATE OR REPLACE FUNCTION make_fibunacci (p_limit IN NUMBER) 
                  RETURN t_fibunacci_array PIPELINED IS

num_a number;
num_b number;

BEGIN

num_a := 0;
num_b := 1;

  FOR i IN 1 .. p_limit LOOP

    -- This is the important command
    PIPE ROW(t_fibunacci_row(i, num_a));   
    
  num_b := num_a + num_b;
  num_a := num_b - num_a;
    
  END LOOP;

  RETURN;
END;
/

The Pipe row command will return a t_fibunacci_row type each time the outer query that uses this function will need the next row.
Using the pipe row grantee that the function is running parallel to the main query,
and doesn't wait for every object in the returning result to be calculate.
whenever a record is "ready" it will be piped to the main query.

Using this function is easy:

SELECT * FROM table(make_fibunacci(10));

ID     VALUE
=== =====
1        0
2        1
3        1
4        2    
5        3
6        5
7        8
8        13
9        21
10      34


This function acts like a view in all situations so you can easily join it with other tables, create a view depending on it and more.
You cannot index this function however, as you cannot index a view.

How to simulate a view that receives arguments

Every DB developer knows this situation.
You create a general view that combine a bunch of tables and manipulate the data but doesn't filter for a specific business object.
In that case, you can't just select * from this view (because the tables are huge), you have to use a where clause:

select * from my_view a
where a.dept_id = :var;

And that work just fine!

This works like a view that receives an argument of dept_id, because the "push predicates" mechanism will first add this new filter into the view, and select by it.

Let's see what happens if we run this query, this time with more than 1 object for filtering:

    select * from my_view a
    where a.dept_id in (select dept_id from depts where manager_id is null);

In this case, there is a highly chance that the original explain plan of the view will "be messed", and in most of the cases that's not a bad thing.

But sometimes, the optimizer is not as smart as you, and won't realize that the IN clause contains only  a few records and it would better for it to just manage this query as if it was written like this:

select * from my_view a
where a.dept_id = :var1
union all
select * from my_view a
where a.dept_id = :var2
union all
select * from my_view a
where a.dept_id = :var3;

And use the original explain plan 3 times (or more), instead of using a new explain plan.

In other words, how can we simulate a view that receives an argument?

It's easy, and it worked for me many times.
Just use a function that return a table!
This function will receive an argument, execute the view's query with this argument, and return a table.

So now, go ahead and run this query:

select * from table(my_view_func) a
    where a.dept_id in (select dept_id from depts where manager_id is null);

And check performance.

Saturday, December 28, 2013

Lighter and faster indexes

Suppose you have a huge table, and that table contains a column with very low cardinality.
One of the values of that column represent most of the table (let's say 99%).

For example:

Create table missions
(
    id number,
    name varchar2(100),
    status number,
    <more columns>
);

select status, count(1) from missions
group by status;

status  count(1)
==== ======
1         20000
2         30000
3         50000000

The application wants to select all the missions with status 1.
It's obvious that we need to create an index with 'status' as the first column.

However, creating a normal index on ('status','id','name') will index all the values of the table, even though we will NEVER use this index for selecting status =3, and that's just a waste of memory and a heavy index that can reflect on performance.

The solution
Using a function-based index and the fact that Oracle isn't indexing null values, we can create this nice index:

Create index missions_sts_1 on missions
(
      decode(status, 1, 1, NULL),
      decode(status, 1, id, NULL),
      decode(status, 1, name, NULL)
)

(and a very similar index for status=2).

This index will weight 1% from the original index, and that's a huge different.

Just don't forget to use in your queries the full column definition as it written on the light index:

select decode(status, 1, id, NULL)
from missions
where decode(status, 1, 1, NULL) = 1

Good luck!