45565

R - multiple criteria search

<h3>Question</h3>

I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:

<ul><li>5 million observations</li> <li>7 variables of which three of interest in this case:</li> <li>Code ID Buy</li> <li>Code ID Sell</li> <li>Date</li> </ul>

I would like another variable called new, which takes the value 0 in line i if:

<ul><li>there exists an observation k, with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1.</li> </ul>

Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1.

What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.

EDIT: A reduced sample would look like this:

library(data.table) set.seed(1) Data <- data.frame( Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5), Amount <- rnorm(21,mean=100,sd=20), Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"), Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"), Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2), LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A") ) Data$new<-0 setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := { r <- rowid(Code_ID_Buy, Code_ID_Sell) +(r==1L) }] Data[Month==1L, new:=0L] View(Data)

The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]

However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)

In the example above, line 7,14 and 19 should be marked with 1 but none of the others.

In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).

Is this sufficient information ? Thank you a lot already for your time.

EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)

I have also however a small question : I used

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last", by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand) for each i?
<h3>Answer1:</h3>

An option is to order by Month, Code_ID_Buy and Code_ID_Sell and then set the first row in each group of Code_ID_Buy and Code_ID_Sell to 1:

library(data.table) setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := +(rowid(Code_ID_Buy, Code_ID_Sell) == 1L)]

Your requirement basically translates into finding the first combi of Code_ID_Buy and Code_ID_Sell, here is another option using duplicated:

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := !duplicated(.SD), .SDcols=c("Code_ID_Buy", "Code_ID_Sell")]

output:

Month Amount Code_ID_Buy Code_ID_Sell new 1: 1 87.47092 100D 98C 1 2: 1 103.67287 100D 99C 1 3: 2 83.28743 100D 98C 0 4: 2 131.90562 100D 99C 0 5: 3 106.59016 100D 98C 0 6: 3 83.59063 100D 99C 0 7: 3 109.74858 100D 96V 1 8: 4 114.76649 100D 98C 0 9: 4 111.51563 100D 99C 0 10: 4 93.89223 100D 96V 0 11: 5 130.23562 100D 98C 0 12: 5 107.79686 100D 99C 0 13: 5 87.57519 100D 96V 0 14: 5 55.70600 100D 94D 1 15: 6 122.49862 100D 98C 0 16: 6 99.10133 100D 99C 0 17: 6 99.67619 100D 96V 0 18: 6 118.87672 100D 94D 0 19: 3 116.42442 102D 25A 1 20: 4 111.87803 102D 25A 0 21: 5 118.37955 102D 25A 0 Month Amount Code_ID_Buy Code_ID_Sell new

This solution has 1 for the first few rows as well since there are no rows prior to them with the same Code_ID_Buy or Code_ID_Sell. If it is necessary to remove them, you can use Data[Month==1L, new := 0L]

<hr />

For the 2nd question, you can use a non-equi join to find rows where Month is before current one and new=1L for the same Code_ID_Buy:

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last", by=.EACHI, i.Month - x.Month]$V1]

output:

Month Amount Code_ID_Buy Code_ID_Sell new distance 1: 1 87.47092 100D 98C 1 NA 2: 1 103.67287 100D 99C 1 NA 3: 2 83.28743 100D 98C 0 NA 4: 2 131.90562 100D 99C 0 NA 5: 3 106.59016 100D 98C 0 NA 6: 3 83.59063 100D 99C 0 NA 7: 3 109.74858 100D 96V 1 2 8: 4 114.76649 100D 98C 0 NA 9: 4 111.51563 100D 99C 0 NA 10: 4 93.89223 100D 96V 0 NA 11: 5 130.23562 100D 98C 0 NA 12: 5 107.79686 100D 99C 0 NA 13: 5 87.57519 100D 96V 0 NA 14: 5 55.70600 100D 94D 1 2 15: 6 122.49862 100D 98C 0 NA 16: 6 99.10133 100D 99C 0 NA 17: 6 99.67619 100D 96V 0 NA 18: 6 118.87672 100D 94D 0 NA 19: 3 116.42442 102D 25A 1 NA 20: 4 111.87803 102D 25A 0 NA 21: 5 118.37955 102D 25A 0 NA Month Amount Code_ID_Buy Code_ID_Sell new distance <hr />

for your 3rd qn, you can try:

Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))]

output:

Month Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr 1: 1 87.47092 100D 98C NA 98C 0 NA NA 2: 1 103.67287 100D 99C NA 98C 0 NA NA 3: 2 83.28743 100D 98C NA 98C 0 NA NA 4: 2 131.90562 100D 99C NA 98C 0 NA NA 5: 3 106.59016 100D 98C NA 98C 0 NA NA 6: 3 83.59063 100D 99C NA 98C 0 NA NA 7: 3 109.74858 100D 96V 0 98C 1 NA 0 8: 4 114.76649 100D 98C 1 96V 0 NA 1 9: 4 111.51563 100D 99C 1 96V 0 NA 1 10: 4 93.89223 100D 96V 1 96V 0 NA 1 11: 5 130.23562 100D 98C 2 96V 0 NA 2 12: 5 107.79686 100D 99C 2 96V 0 NA 2 13: 5 87.57519 100D 96V 2 96V 0 NA 2 14: 5 55.70600 100D 94D 2 96V 1 2 0 15: 6 122.49862 100D 98C 1 94D 0 NA 1 16: 6 99.10133 100D 99C 1 94D 0 NA 1 17: 6 99.67619 100D 96V 1 94D 0 NA 1 18: 6 118.87672 100D 94D 1 94D 0 NA 1 19: 3 116.42442 102D 25A 0 <NA> 1 NA 0 20: 4 111.87803 102D 25A 1 25A 0 NA 1 21: 5 118.37955 102D 25A 2 25A 0 NA 2 Month Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr

p.s.: <em>Welcome to SO. Just to let you know that it might be good ettiquete to post a new question rather than add a new question to a previous question and in addition, you might get others to respond as well.</em>

来源:https://stackoverflow.com/questions/60441583/r-multiple-criteria-search

Recommend