我有我的数据集
df <- data.frame(ID = c("m1","m2","m3","m4","m5","m6","m2","m3","m5","m6","m1","m4","m5"),
Year = c(1,1,1,1,1,1,2,2,2,2,3,3,3))
并想检查ID是否出现在上一年。现在我有一个似乎有用的代码
df$Check <- apply(df, 1, function(x) x["ID"] %in% df[df$Year == (as.numeric(x["Year"]) - 1), "ID"])
但鉴于我的数据集长达300万行,此功能运行时间太长。有没有更好的替代方案?
答案 0 :(得分:4)
尝试
library(dplyr)
dfs <- split(df$ID, df$Year);
df$check <- unlist(mapply(`%in%`, dfs, lag(dfs)))
答案 1 :(得分:1)
k = length(unique(df$Year)) # how many years in the data
q = unique(df$Year) # which are the years present
func <- function(x){
kk = df$ID[df$Year == q[x]] # get the current year's ID which are present
kk %in% df$ID[df$Year == q[x-1]] # compare that to the previous year's ID
}
x <- sum(df$Year==unique(df$Year)[1]) #to know how many FALSE to be added initially
df$check <- c(rep(FALSE, x),unlist(lapply(2:k, func)))
答案 2 :(得分:1)
您可以对每个ave
使用ID
:,计算当前Year
与Year
之前的差异(diff
)。垫有一个前导零。检查结果是否为1
以创建逻辑向量:
df$check2 <- with(df, ave(Year, ID, FUN = function(x) c(0, diff(x))) == 1)
# ID Year check check2
# 1 m1 1 FALSE FALSE
# 2 m2 1 FALSE FALSE
# 3 m3 1 FALSE FALSE
# 4 m4 1 FALSE FALSE
# 5 m5 1 FALSE FALSE
# 6 m6 1 FALSE FALSE
# 7 m2 2 TRUE TRUE
# 8 m3 2 TRUE TRUE
# 9 m5 2 TRUE TRUE
# 10 m6 2 TRUE TRUE
# 11 m1 3 FALSE FALSE
# 12 m4 3 FALSE FALSE
# 13 m5 3 TRUE TRUE
与data.table
类似:
对于每个ID
(by = ID
),创建新变量check2
:检查数据中当前Year
和Year
之间的差异是否为1
((diff(year) == 1
),即如果前一年是之前的年。
library(data.table)
setDT(df)[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
通过OP编辑以下评论。如果“同一个ID中的多个条目”,则对删除了重复行的数据执行计算(unique
)。然后将结果加入原始数据。
df2 <- unique(df)
df2[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
df[df2, on = c("ID", "Year")]