R: rolling self-joins on subsets in data.table -


here mwe of problem.

data:

library(data.table) #dates in %y-%m-%d         df <- data.table(date=as.date(c("2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02", "2001-01-02")), dtm=c(18l,  18l, 18l,  18l,  18l, 18l,46l,46l,74l, 74l,74l,74l,165l, 165l,165l,165l), cval=c(1275l, 1300l, 1300l, 1320l, 1325l, 1325l, 1300l, 1300l, 1300l, 1300l, 1325l, 1325l, 1300l, 1300l, 1325l, 1325l), price_in=c(24.125, 24.625, 35.750, 16.250, 14.500, 50.250, 43.625, 49.125, 58.250, 58.250, 45.375, 70.125, 90.750, 74.750, 77.875, 85.500), price_out=c(26.125, 26.625, 36.625, 17.500, 15.500, 52.250, 45.625, 51.125, 60.000, 60.250, 47.375, 72.125, 92.750, 76.750, 79.875, 87.500), type=c("p", "c", "p", "c", "c", "p", "c", "p", "c", "p", "c", "p", "c", "p", "c", "p"))     df           date dtm cval price_in price_out type  1: 2001-01-02  18 1275   24.125    26.125    p  2: 2001-01-02  18 1300   24.625    26.625    c  3: 2001-01-02  18 1300   35.750    36.625    p  4: 2001-01-02  18 1320   16.250    17.500    c  5: 2001-01-02  18 1325   14.500    15.500    c  6: 2001-01-02  18 1325   50.250    52.250    p  7: 2001-01-02  46 1300   43.625    45.625    c  8: 2001-01-02  46 1300   49.125    51.125    p  9: 2001-01-02  74 1300   58.250    60.000    c 10: 2001-01-02  74 1300   58.250    60.250    p 11: 2001-01-02  74 1325   45.375    47.375    c 12: 2001-01-02  74 1325   70.125    72.125    p 13: 2001-01-02 165 1300   90.750    92.750    c 14: 2001-01-02 165 1300   74.750    76.750    p 15: 2001-01-02 165 1325   77.875    79.875    c 16: 2001-01-02 165 1325   85.500    87.500    p 

what want do:

  1. for each date, want obtain items, divided in type p's , c's have same dtm larger cval. second item in example data set, be:

date dtm cval price_in price_out type 2001-01-02 18 1300 24.625 26.625 c #the item 2001-01-02 18 1320 16.250 17.500 c #same dtm, higher cval 2001-01-02 18 1325 14.500 15.500 c #same dtm, higher cval

  1. now, let cval1 cval of current item, i.e. here cval1 = 1300 , cval2 larger cval's of items in subset, i.e. here cval2 = c(1320l, 1325l). then, want apply custom exclusion function, example let's price_in[cval %in% cval2]-price_out[cval==cval1]-0.5*(cval1-cval2) < 0
  2. i want exclude item pairs exclusion function returned true.

similarly (same procedure, different exclusion criteria) applies p items.

expected output: original data.table, df, minus rows excluded in procedure described above. example, using example function above evaluating items 2 , 4 returns true: 16.25-26.625-0.5*(1300-1320) = -0.375 < 0. thus, expected output df without rows 2 , 4 (note pair 2 , 5 not return true: 14.5-26.625-0.5*(1300-1325) = 0.375 >= 0, hence 5 not excluded):

          date dtm cval price_in price_out type  1: 2001-01-02  18 1275   24.125    26.125    p  3: 2001-01-02  18 1300   35.750    36.625    p  5: 2001-01-02  18 1325   14.500    15.500    c  6: 2001-01-02  18 1325   50.250    52.250    p  7: 2001-01-02  46 1300   43.625    45.625    c  8: 2001-01-02  46 1300   49.125    51.125    p           ...      ...       ...       ...      

and on. obviously, in case of items 7 , 8, if there no other item same characteristics (same date, dtm , type), cannot excluded.

what have tried far:

  1. i have created id each item, i.e. df[,id:=seq_along(date)], iterated via for loop through dates , used vectors check custom functions. if result vector contained true's, removed corresponding indices data.table. clearly, approach works runs forever given size of data.
  2. i experimenting data.table methods, because of extreme speed advantages. if understand correctly, want many rolling self-joins each date/dtm subset, along lines "df[df,roll=inf,by=.(date,dtm)]" (since total, rolling self-join, believe, not applicable in case). don't quite work.

question: there way of implementing exclusion procedure via data.table methods? possibly (but not necessarily) via multiple rolling self-joins?

any highly appreciated!


Comments