
Question:
I have my dataset such that
df <- data.frame(ID = c("m1","m2","m3","m4","m5","m6","m2","m3","m5","m6","m1","m4","m5"),
Year = c(1,1,1,1,1,1,2,2,2,2,3,3,3))
and want to perform a check whether the ID appears in the previous year. Now I have a code that seems to work
df$Check <- apply(df, 1, function(x) x["ID"] %in% df[df$Year == (as.numeric(x["Year"]) - 1), "ID"])
but given that my dataset is 3million rows long this function takes far too long to run. Is there a better alternative to this??
Answer1:Try
library(dplyr)
dfs <- split(df$ID, df$Year);
df$check <- unlist(mapply(`%in%`, dfs, lag(dfs)))
Answer2:You may use ave
: for each ID
, calculate the difference between current Year
and preceeding Year
(diff
). Pad with a leading zero. Check if the result is 1
to create a logical vector:
df$check2 <- with(df, ave(Year, ID, FUN = function(x) c(0, diff(x))) == 1)
# ID Year check check2
# 1 m1 1 FALSE FALSE
# 2 m2 1 FALSE FALSE
# 3 m3 1 FALSE FALSE
# 4 m4 1 FALSE FALSE
# 5 m5 1 FALSE FALSE
# 6 m6 1 FALSE FALSE
# 7 m2 2 TRUE TRUE
# 8 m3 2 TRUE TRUE
# 9 m5 2 TRUE TRUE
# 10 m6 2 TRUE TRUE
# 11 m1 3 FALSE FALSE
# 12 m4 3 FALSE FALSE
# 13 m5 3 TRUE TRUE
Similar with data.table
:
For each ID
(by = ID
), create the new variable check2
: check if the difference between current Year
and preceeding Year
in the data is 1
((diff(year) == 1
), i.e. if the preceeding year is the <em>previous</em> year.
library(data.table)
setDT(df)[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
<hr />Edit following comment by OP. In case of "<em>multiple entries of the same ID in the same year</em>", you perform the calulation on data where duplicated rows are removed (unique
). Then join the result to the original data.
df2 <- unique(df)
df2[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
df[df2, on = c("ID", "Year")]
Answer3: