Category Archives: Connect_By

Resolving chains of events with CONNECT_BY

Imagine you have the following data in a table:

event_id    event_date
--------    -----------
A             01.01.2009
B             12.01.2009
C             15.02.2009
D             30.03.2009

You need to identify whether there is at least a single case in which the events with ids A,B,C have happened in this same order. The order is determined by the event_date.

It doesn’t matter how many times each event has happened. It doesn’t matter if there are other events (e.g. with id D) in between. It only matters that A,B,C have happened (or not) at least once in the given order.

E.g. the following three chains satisfy the condition. The events are ordered by event_date. I ommitted the event_date and presented the whole chain “flat” in a single row to make the view more compact:

Chain 1: A,A,D,B,C
Chain 2: C,B,A,D,B,D,B,C,D
Chain 3: A,B,B,B,B,A,C

I can think of at least 3 ways to solve this in Oracle. There are more, for sure – it would be great if you could post something else. But here are the three that come to my mind:

1. PL/SQL function with recursive call – suitable if the chain is not predefined – e.g. it will be supplied with a parameter.

2. Analytic functions – I could imagine ROWS BETWEEN current row AND unbounded following could be used to find out whether e.g. B has (at least once) happend after A. I haven’t tested it, though – could be a topic of another post.

3. CONNECT_BY. This is what the current post is about.

The first time I’ve come to the idea of using connect_by to resolve a chain of events was more than a year ago reading this post by David Aldridge (it is actually in the comment part, comment 2). I registered the idea somewhere in the back of my head and now when I have to solve a similar (although far more simple than David’s) problem it came back to me.

A solution with CONNECT_BY could be slow for big volume of data. So please test for performance – with a data volume comparable to the real one. Here is a very good detailed explanation by Christian Antognini of how CONNECT_BY is executed internally by Oracle.

I think of the following solution:

select chain_of_event_dates
from
    -- sys_connect_by_path gives us convenient way to "flatten" the chain
    (
    select sys_connect_by_path(event_nr,'/') as chain_of_event_nr,
           sys_connect_by_path(to_char(event_date,'dd.mm.yyyy'),'/') as chain_of_event_dates
    from
        -- intermediate layer to isolate event_id from its sequence nr in the chain
        -- the chain we look for is now effectively "1/2/3"
        -- we use this in the connect_by
        (
        select
            case when event_id='A' then 1
                   when event_id='B' then 2
                   when event_id='C' then 3
                   else null
            end as event_nr,
            event_id,
            event_date
        from t
        where chain_id=&&1
        )
    start with event_nr=1
    connect by ( prior event_nr+1 = event_nr
                      and
                      prior event_date < event_date
                    )
    )
where chain_of_event_nr='/1/2/3'
and rownum<=1;

In the statement above I do the following:

1. I first use CASE to bring the concrete chain to a general case where the events are plain sequential numbers. After that the chain becomes effectively “1/2/3”.

Please not that I included chain_id in the where clause. I’ve done this for two reasons: (1) it gives me an easy method to test (see the test at the end) and (2) in a real sitution we would normaly have some wehere condition

2. Then I define in “conect by” terms the conditions which mean “chain discovered”:

— start the evaluation with the first event in our defined chain (ignore all other events when chosing the starting point)
start with event_nr=1

— in our generalized chain we could use simple math:
— “next event” = “previous event” + 1
— AND “next event” must have happend on a later date (otheweise it wouldn’t be “next”)
connect by ( prior event_nr+1 = event_nr
and
prior event_date < event_date
)

3. In the end I "flatten" the chain with sys_connect_by_path to alow for an easy check – it must then match "1/2/3".

I created the following data to check the solution:

drop table t;

create table t(
chain_id number,
event_id varchar2(10),
event_date date);

insert into t values(1,'A',to_date('1.1.2001','dd.mm.yyyy'));
insert into t values(1,'A',to_date('2.2.2002','dd.mm.yyyy'));
insert into t values(1,'D',to_date('3.3.2003','dd.mm.yyyy'));
insert into t values(1,'B',to_date('4.4.2004','dd.mm.yyyy'));
insert into t values(1,'C',to_date('5.5.2005','dd.mm.yyyy'));

insert into t values(2,'C',to_date('1.1.2001','dd.mm.yyyy'));
insert into t values(2,'B',to_date('2.2.2002','dd.mm.yyyy'));
insert into t values(2,'A',to_date('3.3.2003','dd.mm.yyyy'));
insert into t values(2,'D',to_date('4.4.2004','dd.mm.yyyy'));
insert into t values(2,'B',to_date('5.5.2005','dd.mm.yyyy'));
insert into t values(2,'D',to_date('6.6.2006','dd.mm.yyyy'));
insert into t values(2,'B',to_date('7.7.2007','dd.mm.yyyy'));
insert into t values(2,'C',to_date('8.8.2008','dd.mm.yyyy'));
insert into t values(2,'D',to_date('9.9.2009','dd.mm.yyyy'));

insert into t values(3,'A',to_date('1.1.2001','dd.mm.yyyy'));
insert into t values(3,'B',to_date('2.2.2002','dd.mm.yyyy'));
insert into t values(3,'B',to_date('3.3.2003','dd.mm.yyyy'));
insert into t values(3,'B',to_date('4.4.2004','dd.mm.yyyy'));
insert into t values(3,'B',to_date('5.5.2005','dd.mm.yyyy'));
insert into t values(3,'A',to_date('6.6.2006','dd.mm.yyyy'));
insert into t values(3,'C',to_date('7.7.2007','dd.mm.yyyy'));

insert into t values(4,'C',to_date('1.1.2001','dd.mm.yyyy'));
insert into t values(4,'B',to_date('2.2.2002','dd.mm.yyyy'));
insert into t values(4,'B',to_date('3.3.2003','dd.mm.yyyy'));
insert into t values(4,'A',to_date('4.4.2004','dd.mm.yyyy'));
insert into t values(4,'C',to_date('5.5.2005','dd.mm.yyyy'));

insert into t values(5,'A',to_date('1.1.2001','dd.mm.yyyy'));
insert into t values(5,'C',to_date('2.2.2002','dd.mm.yyyy'));
insert into t values(5,'B',to_date('3.3.2003','dd.mm.yyyy'));
insert into t values(5,'A',to_date('4.4.2004','dd.mm.yyyy'));
insert into t values(5,'D',to_date('5.5.2005','dd.mm.yyyy'));

commit;

In the given data, chains with id 1,2,3 have to be reported as “valid” according to our definition. Chains with id 4,5 are “not vlid”. I’ve put the SQL from above in a script and run it for each of the 5 chains.

SQL> @D:\botev\blogs\chain_connect_by_select.sql 1
alt 21: where chain_id=&&1
neu 21: where chain_id=1

CHAIN_OF_EVENT_DATES
---------------------------------------------------
/01.01.2001/04.04.2004/05.05.2005

SQL> @D:\botev\blogs\chain_connect_by_select.sql 2
alt 21: where chain_id=&&1
neu 21: where chain_id=2

CHAIN_OF_EVENT_DATES
---------------------------------------------------
/03.03.2003/05.05.2005/08.08.2008

SQL> @D:\botev\blogs\chain_connect_by_select.sql 3
alt 21: where chain_id=&&1
neu 21: where chain_id=3

CHAIN_OF_EVENT_DATES
---------------------------------------------------
/01.01.2001/02.02.2002/07.07.2007

SQL> @D:\botev\blogs\chain_connect_by_select.sql 4
alt 21: where chain_id=&&1
neu 21: where chain_id=4

Es wurden keine Zeilen ausgewählt (0 rows selected in German)

SQL> @D:\botev\blogs\chain_connect_by_select.sql 5
alt 21: where chain_id=&&1
neu 21: where chain_id=5

Es wurden keine Zeilen ausgewählt (0 rows selected in German)

Advertisements